80  Standardisation and Normalisation in ML

80.1 Introduction

In machine learning, it’s common to scale data (especially predictor variables (IVs)) to ensure consistency in model application and interpretation.

Standardisation and normalisation are techniques that scale the data. They do it in slightly different ways, which can significantly impact the performance of the ML algorithms.

80.2 Standardisation

  • Standardisation involves shifting the distribution of each variable to have a mean of 0 and a standard deviation of 1.

  • It’s especially useful for algorithms that assume data is normally distributed, like logistic regression or linear regression.

  • We can use the scale function in R to standardise our data.

An example for one variable

# Load necessary library

# Create sample data
data <- data.frame(Score = rnorm(100, mean = 50, sd = 10))

# Standardise a vector in the dataset
data$Standardised_Score <- scale(data$Score)

# Plot original vs. standardised data
p1 <- ggplot(data, aes(x = Score)) +
  geom_histogram(binwidth = 1, fill = "blue", alpha = 0.7) +
  ggtitle("Original Data")

p2 <- ggplot(data, aes(x = Standardised_Score)) +
  geom_histogram(binwidth = 0.1, fill = "green", alpha = 0.7) +
  ggtitle("Standardised Data, mean = 0 and sd = 1")

gridExtra::grid.arrange(p1, p2, ncol = 2)

An example where all predictors are scaled

# Create sample dataframe
data <- data.frame(
  Variable1 = rnorm(100, mean = 20, sd = 5),
  Variable2 = runif(100, min = 10, max = 50),
  Variable3 = rnorm(100, mean = 0, sd = 1),
  Variable4 = rbinom(100, size = 10, prob = 0.5),
  Variable5 = runif(100, min = 0, max = 100)  # This variable will not be scaled

# Scale first four variables by creating another dataframe
data_scaled <- as.data.frame(lapply(data[1:4], scale))

# Add the unscaled Variable5 back into the scaled dataframe
data_scaled$Variable5 <- data$Variable5

# Display scaled data
    Variable1  Variable2  Variable3  Variable4 Variable5
1 -0.71304802 -0.8402942  0.8265119 -0.6624077  23.72297
2 -0.35120270  1.6182640  0.8065105 -0.6624077  68.64904
3  1.60854170  0.3917819  0.3391857 -0.6624077  22.58184
4 -0.02179795  0.0984535 -1.0949469 -1.2932721  31.84946
5  0.04259548 -0.2836195 -0.1439886 -0.6624077  17.39838
6  1.77983218  1.3392854 -0.3161629 -0.6624077  80.14296

80.3 Normalisation

  • Normalisation adjusts the scale of the data so that the range is between 0 and 1.

  • This is useful for algorithms that compute distances between data points, like K-Nearest Neighbors (KNN) and K-Means clustering.

  • We can use ‘min-max scaling’ in R to achieve this.

# using the same dataset created above
# normalise data using min-max 
data$Normalised_Variable1 <- (data$Variable1 - min(data$Variable1)) / (max(data$Variable1) - min(data$Variable1))

# plot original vs. normalised
p3 <- ggplot(data, aes(x = Normalised_Variable1)) +
  geom_histogram(binwidth = 0.02, fill = "red", alpha = 0.7) +
  ggtitle("Normalised Data")

gridExtra::grid.arrange(p1, p3, ncol = 2)

80.4 When to use

The choice between standardisation and normalisation depends on the characteristics of your data and the requirements of the algorithm you’re using.

We typically standardise data when you’re dealing with features that have a Gaussian (bell curve) distribution. Standardisation is important for models that assume that all features are centered around zero and have variance in the same order, such as:

  • Linear Regression
  • Logistic Regression
  • Support Vector Machines
  • Principal Component Analysis (PCA)
  • Algorithms that compute distances or assume normality

Normalisation rescales the data into a range of [0, 1] or [-1, 1]. It’s useful when you need to scale the features so they’re in a bounded interval.

Normalisation is often the choice for models that are sensitive to the magnitude of values and where you don’t assume any specific distribution of features, such as:

  • Neural Networks
  • k-Nearest Neighbors (k-NN)
  • k-Means Clustering
  • Situations where you need to maintain zero entries in sparse data.

In practice, I’d suggest that you try both methods as part of your exploratory data analysis to determine which scaling technique works better for your specific model and dataset.